Ogden's Lemma for Regular Tree Languages 
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1 Introduction 

Pumping lemmata are elementary tools for the analysis of formal languages. 
While they usually cannot be made strong enough to fully capture a class 
of languages, it is generally desirable to have as strong pumping lemmata 
as possible. However, this is counterbalanced by the experience that strong 
pumping lemmata may be hard to prove, or, worse, hard to use — this ex- 
perience has been made, for example, in the study of the output languages 
of tree transducers, where other proof techniques, so-called bridge theorems, 



make better tools ( Engelfriet and Maneth , 2002 ) . The purpose of this squib 



is to strengthen the standard pumping lemma for the class of regular tree 



languages (Gecseg and Steinby 1997), without sacrificing its usability, in 



the same way as Ogden strengthened the pumping lemma for context-free 



string languages (Ogden, 19681 



The paper is structured as follows. Section [2] introduces our notation. 
Section [3] presents the main lemma and motivates it using a small, formal 
example. Finally, Section |4] contains the proof of the main lemma. 



2 Preliminaries 



We assume the reader to be familiar with the standard concepts from the 
theory of tree languages. The notation that we use in this paper is mostly 

*I wish to thank Mathias Mohl and an anonymous reviewer for pointing out errors, and 
for comments that lielped to improve the quality of the presentation. The work reported 
in this paper was partially funded by the German Research Foundation. 
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identical to the one used in the survey by Gecseg and Steinby (19971; the 
major difference is our way of denoting substitution into contexts. 

We write N for the set of non-negative natural numbers, and [n] as an 
abbreviation for the set {iEN|l<«<n}. Given a set A, we write \A\ 
for the cardinality of A, and A* for the set of all strings over A. 

Let S be a ranked alphabet. For a tree t £ Ty;, we write \t\ to denote 
the size of t, defined as the number of nodes of t. A path in t is a, sequence 
of nodes of t in which each node but the first one is a child of the node 
preceding it. Let o be a symbol with rank zero that does not occur in S. 
Recall that a context over S is a tree c over S U { o } in which o occurs 
exactly once. We call the (leaf) node at which the symbol o occurs the hole 
of the context. We write |c| to denote the size of the context c, defined as 
the number of non-hole nodes of c. Finally, given a context c and a tree t, 
we write c • t for the tree obtained by substituting t into c at its hole. Note 
that Gecseg and Steinby denote this tree by t • c or c{t). 

A subset L C Ts is a tree language over S. A tree language is regular, if 
there is a finite-state tree automaton that accepts L. 

3 Motivation 

To motivate the need for a strong pumping lemma for regular tree languages, 
we start with a look at the standard one (Gecseg and Steinby 1997[ lp" 



Lemma 1 For every regular tree language L CT^,, there is a number p > 1 
such that any tree t £ L of size at least p can be written as t = c' ■ c ■ t' in 
such a way that \c\ > 1, |c • t'\ < p, and c' ■ c'^ ■ t' £ L, for every n E N. □ 

Just as the pumping lemma for context-free string languages. Lemma [T] is 
most often used in its contrapositive formulation, which specifies a strategy 
for proofs that a language L C T^; is not regular: show that, for all p > 1, 
there exists a tree t £ L size at least p such that for any decomposition 
c' • c • t' of t in which |c| > 1 and |c • t'| < p, there is a number n G N such 
that c' ■ -t' ^ L. It is helpful to think of a proof according to this strategy 
as a game against an imagined adversary, where our objective is to prove 
that L is non-regular, and adversary's objective is to foil this proof. The 
game consists of four alternating turns: In the first turn, ADVERSARY must 
choose a number p > 1. In the second turn, we must respond to this choice 



^The lemma given here is in fact slightly stronger than the one given by Gecseg and 
|Steinby| ( |1997[ ) (Proposition 5.2), and makes pumpability dependent on the size of a tree, 
rather than on its height. 
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Figure 1: Two tree languages that are not regular 



by providing a tree t G L of size at least p. In the third turn, ADVERSARY 
must choose a decomposition of t into fragments c' ■ c ■ t' such that |c| > 1 
and \c ■ t'\ < p. In the fourth and final turn, we must provide a number 
n G N such that c' ■ ■ t' ^ L. If we are able to do so, we win the game; 
otherwise, adversary wins. We can prove that L is non-regular, if we have 
a winning strategy for the game. 

Consider the language Li = { f{g^ • a, • a) | n > 1 }, shown schemat- 
ically in Figure la Using Lemma [T| it is easy to show that this lan- 
guage is non-regular: we always win by presenting adversary with the 
tree t = f(gP -a^g^ -a). To see this, notice that in whatever way adversary 
decomposes t into fragments c' • c • t' such that |c| > 1, the pumped tree 
(!■(?• t' does not belong to Li. In particular, if c is rooted at a node that is 
labelled with g, then the pumped tree violates the constraint that the two 
branches have the same length. 

Unfortunately, Lemma [T] sometimes is too blunt a tool to show the non- 
regularity of a tree language. Consider the language 



{f{g''-h^' ■a,g^-h"'' -a) \ n,mi,m2 > 1} 



(see Figure lb). It is not unreasonable to believe that L2, like Li, is non- 
regular, but it is impossible to prove this using Lemma [T] To see this, 
notice that adversary has a winning strategy for p > 2: for every tree 
t € L2 that we can provide in the second turn of the game, adversary 
can choose any decomposition c' ■ c ■ t' in which c = h{o) and t' = a. In 
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this case, |c| > 1, \c- t'\ < p, and both deleting and pumping c yield only 
valid trees in L2. Intuitively, we would like to force adversary to choose a 
decomposition that contains a (7-labelled node, thus transferring our winning 
strategy for Li — but this is not warranted by LemmajT] which merely asserts 
that a pumpable context does exist somewhere in the tree, but does not allow 
us to delimit the exact region. The pumping lemma that we prove in this 
paper makes a stronger assertion: 

Lemma 2 For every regular tree language L C T^;, there is a number p > 1 
such that every tree t G L in which at least p nodes are marked as distin- 
guished can be written as t = c' ■ c ■ t' such that at least one node in c is 
marked, at most p nodes in c ■ t' are marked, and c' ■ c"" ■ t' G L, for all 
n£N. □ 

Note that, in the special case where all nodes are marked. Lemma [2] reduces 
to Lemma [TJ 

Lemma |2] can be seen as the natural correspondent of Ogden's Lemma 



for context-free string languages (Ogden 19681. Its contrapositive corre- 



sponds to the following modified game for tree languages L: In the first 
turn, ADVERSARY has to choose a number p > 1. In the second turn, we 
have to choose a tree t G L and mark at least p nodes in t. In the third turn, 
ADVERSARY has to choose a decomposition d ■ c - 1' olt in such a way that 
at least one node in c and at most p nodes in c-t' are marked. In the fourth 
and final turn, we have to choose a number n G N such that c' ■ c"" ■ t' ^ L. 
In this modified game, we can implement our idea from above to prove that 
the language L2 is non-regular: we can always win the game by presenting 
ADVERSARY with the tree t = f{gP ■ h{a),gP ■ h{a)) and marking all nodes 
that are labelled with g as distinguished. Then, in whatever way adver- 
sary decomposes t into segments c' ■ c ■ t' , the context c contains at least 
one node labelled with g, and the tree c' ■ ■ t' does not belong to L2. 



4 Proof 

Our proof of Lemma [2] builds on the following technical lemma: 

Lemma 3 Let T, be a ranked alphabet. For every tree language L Ty. 
and every k>l, there exists a number p > 1 such that every tree t G L in 
which at least p nodes have been marked as distinguished can be written as 
t = c' ■ ci ■ ■ ■ Ck ■ t' in such a way that for each i G [k], the context Ci contains 
at least one marked node, and the tree ci • • • • t' contains at most p marked 
nodes. □ 
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Proof Let m be the maximal rank of any symbol in S. Note that if m 
is zero, then each tree over S has size one, and the lemma trivially holds 
with p = 2. For the remainder of the proof, assume that rn > 1. Put 
gj^in-) = Yl^=o note that gj:{n) < gj:{n+ 1), for all n G N. We will 

show that we can choose p = g-s{k). 

Let t € L he a tree in which at least one node has been marked as 
distinguished. We call a node u of t interesting, if it either is marked, or has 
at least two children from which there is a path to an interesting node. It 
is easy to see from this definition that from every interesting node, there is 
a path to a marked node. Let d{u) denote the number of interesting nodes 
on the path from the root node of t to u, excluding u itself. We make two 
observations about the function d{u): 

First, there is exactly one interesting node u with d{u) = 0. To see that 
there is at most one such node, let ui and 112 be distinct interesting nodes 
with d(ui) = d{u2) = n; then the least common ancestor u of ui and U2 is 
an interesting node with d{u) = n — 1. To see that there is at least one such 
node, recall that every marked node is interesting. 

For the second observation, let u be an interesting node with d{u) = n. 
The number of interesting descendants v of u with d{v) = n+1 is at most m. 
To see this, notice that each path from u to v starts with u, continues 
with some child u' of u, and then visits only non-interesting nodes w until 
reaching From each of these non-interesting nodes w, there is at most 
one path that leads to v. Therefore, the path from u to is uniquely 
determined except for the choice of the child u' , which is a choice among at 
most m alternatives. 

Taken together, these observations imply that the number of interesting 
nodes u with d{u) < A; — 1 is bounded by the value gj:{k — 1). 

Now, let i be a tree in which at least gT,{k) nodes have been marked as 
distinguished. Then there is at least one interesting node u with d{u) = k, 
and hence, at least one path that visits at least k + 1 interesting nodes. 
Choose any path that visits the maximal number of interesting nodes, and 
let tt be a suffix of that path that visits exactly A; -|- 1 interesting nodes, call 
them vi, . . . , Vk^i. We use u to identify a decomposition ci • • ■ • t' of t as 
follows: for each i G [k], choose Vi as the root node of Cj, choose Wj+i as 
the hole of Cj, and choose v^^i as the root node of t' . This decomposition 
satisfies the required properties: To see that the tree ci - • • Ck - t' contains at 
most p marked nodes, notice that, by the choice of u, no path in t that starts 
at vi contains more than k+1 interesting nodes, and hence the total number 
of interesting nodes in the subtree rooted at vi is bounded by gT,{k) = p. 
To see that every context Cj, i G [k], contains at least one marked node. 
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let V be one of the interesting nodes in Cj, and assume that v is not itself 
marked. Then v has at least two children from which there is a path to an 
interesting, and, ultimatively, to a marked node. At most one of these paths 
visits Vj+i; the marked node at the end of the other path is a node of q. ■ 

With Lemma |3] at hand, the proof of Lemma [2] is straightforward, and 
essentially identical to the proof given for the standard pumping lemma 



(Gecseg and Steinby, 1997 1 : 



Proof (of Lemma [2]) Let L C be a regular tree language, and let M 
be a tree automaton with state set Q that recognizes L. We will apply 
Lemma [3] with k = \Q\. Let t G L be a tree in which at least p nodes are 
marked as distinguished, where p is the number from Lemma [3| Then t can 
be written as c' • ci • • • • t' such that for each index i G [/c] , the context Cj 
contains at least one marked node, and the tree ci • • • c^-t' contains at most p 
marked nodes. Note that each context Cj, i G [A:], is necessarily non-empty. 
Since M has only k states, it must arrive in the same state at the root nodes 
of at least two contexts Cj, i G [/c], or at the root node of some context Cj, 
i G [fc], and the root node of t'. A decomposition of t of the required kind is 
then obtained by cutting t at these two nodes. ■ 

Note that by choosing k = m ■\Q\ in this proof, where m > 1, it is easy to 
generalize Lemma [2] as follows: 

Lemma 4 For every regular tree language L QT^, and every m > 1, there 
is a number p > I such that every tree t £ L in which at least p nodes are 
marked as distinguished can be written as t — c' • ci • • • Cm ■ t' such that for 
each i G [ni], at least one node in Ci is marked, at mostp nodes in ci ■ ■ ■ Cm-t' 
are marked, and c' ■ ■ ■ ■ ■ t' £ L, for all n G N. □ 
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